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Abstract 

Determinantal point processes (DPPs) are point process models that naturally encode diver¬ 
sity between the points of a given realization, through a positive definite kernel K. DPPs possess 
desirable properties, such as exact sampling or analyticity of the moments, but learning the pa¬ 
rameters of kernel K through likelihood-based inference is not straightforward. First, the kernel 
that appears in the likelihood is not K, but another kernel L related to K through an often in¬ 
tractable spectral decomposition. This issue is typically bypassed in machine learning by directly 
parametrizing the kernel L, at the price of some interpretability of the model parameters. We 
follow this approach here. Second, the likelihood has an intractable normalizing constant, which 
takes the form of a large determinant in the case of a DPP over a finite set of objects, and the form 
of a Fredholm determinant in the case of a DPP over a continuous domain. Our main contribution 
is to derive bounds on the likelihood of a DPP, both for finite and continuous domains. Unlike 
previous work, our bounds are cheap to evaluate since they do not rely on approximating the 
spectrum of a large matrix or an operator. Through usual arguments, these bounds thus yield 
cheap variational inference and moderately expensive exact Markov chain Monte Carlo inference 
methods for DPPs. 


1 Introduction 

Determinantal point processes (DPPs) are point processes [1] that encode repulsiveness using algebraic 
arguments. They first appeared in [2], and have since then received much attention, as they arise in 
many fields, e.g. random matrix theory, combinatorics, quantum physics. We refer the reader to 
[3, 4, 5] for detailed tutorial reviews, respectively aimed at audiences of machine learners, statisticians, 
and probabilists. More recently, DPPs have been considered as a modelling tool, see e.g. [4, 3, 6]: DPPs 
appear to be a natural alternative to Poisson processes when realizations should exhibit repulsiveness. 
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In [3], for example, DPPs are used to model diversity among summary timelines in a large news corpus. 
In [7] , DPPs model diversity among the results of a search engine for a given query. In [4] , DPPs model 
the spatial repartition of trees in a forest, as similar trees compete for nutrients in the ground, and 
thus tend to grow away from each other. With these modelling applications comes the question of 
learning a DPP from data, either through a parametrized form [4, 7] , or non-parametrically [8, 9] . We 
focus in this paper on parametric inference. 

Similarly to the correlation between the function values in a Gaussian process [10], the repulsiveness 
in a DPP is defined through a kernel K, which measures how much two points in a realization repel 
each other. The likelihood of a DPP involves the evaluation and the spectral decomposition of an 
operator C defined through a kernel L that is related to K. There are two main issues that arise when 
performing likelihood-based inference for a DPP. First, the likelihood involves evaluating the kernel L, 
while it is more natural to parametrize K instead, and there is no easy link between the parameters 
of these two kernels. The second issue is that the spectral decomposition of the operator £ required 
in the likelihood evaluation is rarely available in practice, for computational or analytical reasons. For 
example, in the case of a large finite set of objects, as in the news corpus application [3], evaluating 
the likelihood once requires the eigendecomposition of a large matrix. Similarly, in the case of a 
continuous domain, as for the forest application [4], the spectral decomposition of the operator £ may 
not be analytically tractable for nontrivial choices of kernel L. In this paper, we focus on the second 
issue, i.e., we provide likelihood-based inference methods that assume the kernel L is parametrized, 
but that do not require any eigendecomposition, unlike [7]. More specifically, our main contribution 
is to provide bounds on the likelihood of a DPP that do not depend on the spectral decomposition of 
the operator £. For the finite case, we draw inspiration from bounds used for variational inference of 
Gaussian processes [11], and we extend these bounds to DPPs over continuous domains. 

For ease of presentation, we first consider DPPs over finite sets of objects in Section 2, and we 
derive bounds on the likelihood. In Section 3, we plug these bounds into known inference paradigms: 
variational inference and Markov chain Monte Garlo inference. In Section 4, we extend our results to 
the case of a DPP over a continuous domain. Readers who are only interested in the finite case, or 
who are unfamiliar with operator theory, can safely skip Section 4 without missing our main points. 
In Section 5, we experimentally validate our results, before discussing their breadth in Section 6. 


2 DPPs over finite sets 

2.1 Definition and likelihood 

Gonsider a discrete set of items y = {xi,..., x„}, where x^ C is a vector of attributes that describes 
item i. Let K he a. symmetric positive definite kernel [12] on and let K = ((i£(xi,x^))) be the 
Gram matrix of K. The DPP of kernel K is defined as the probability distribution over all possible 
2" subsets Y cy such that 

P(ylcr)=det(K^), (1) 

where denotes the sub-matrix of K indexed by the elements of A. This distribution exists and 
is unique if and only if the eigenvalues of K are in [0,1] [5]. Intuitively, we can think of iF(x,y) as 
encoding the amount of negative correlation, or “repulsiveness” between x and y. Indeed, as remarked 
in [3], (1) first yields that diagonal elements of K are marginal probabilities: P(xi G P) = Ku. 
Equation (1) then entails that x^ and Xj are likely to co-occur in a realization of Y if and only if 

detRTixi.xj} = K{y:i,x,)K{yi,yi) - R:(x,,Xj)^ = P(xi G F)P(xj G P) - Kfj 
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is large: off-diagonal terms in K indicate whether points tend to co-occur. 

Providing the eigenvalues of K are further restricted to be in [0,1), the DPP of kernel K has a 
likelihood [1]. More specifically, writing Yi for a realization of Y, 


P(y = Yi) 


det Lyj 
det(L-kl)’ 


( 2 ) 


where L = (I — K)~^K, I is the n x n identity matrix, and denotes the sub-matrix of L indexed 
by the elements of Yi. Now, given a realization Yi , we would like to infer the parameters of kernel K, 
say the parameters 9k = {a.K,cFK) G (0,oo)^ of a squared exponential kernel [10] 


K{Ti,y) = aifexp 


l|x-y|p \ 

2(t\ ) 


Since the trace of K is the expected number of points in Y [-j] , one can estimate ax by the number 
of points in the data divided by n [4]. But aK, the parameter governing the repulsiveness, has to be 
fitted. If the number of items n is large, likelihood-based methods such as maximum likelihood are 
too costly: each evaluation of (2) requires 0{n?') storage and 0{n^) time. Furthermore, valid choices 
of 9k are constrained, since one needs to make sure the eigenvalues of K remain in [0,1). 

A partial work-around is to note that given any symmetric positive definite kernel L, the likelihood 
(2) with matrix L = ((L(xi,Xj))) corresponds to a valid choice of K, since the corresponding matrix 
K = L(I -I- L)“^ necessarily has eigenvalues in [0,1], which makes sure the DPP exists [5]. The 
work-around consists in directly parametrizing and inferring the kernel L instead of K, so that the 
numerator of (2) is cheap to evaluate, and parameters are less constrained. Note that this step favours 
tractability over interpretability of the inferred parameters: if we assume 


A(x,y) = o^exp 


l|x-yf\ 

J’ 


the number of points and the repulsiveness of the points in Y do not decouple as nicely as when K 
is squared exponential. For example, the expected number of items in Y depends on ol and ctl now, 
and both parameters also affect repulsiveness. There is some work investigating approximations to K 
to retain the more interpretable parametrization [4], but the machine learning literature [3, 7] almost 
exclusively adopts the more tractable parametrization of L. In this paper, we also make this choice of 
parametrizing L directly. 

Now, the only expensive step in the evaluation of (2) is the computation of det(L -|- I). While 
this still prevents the application of maximum likelihood, bounds on this determinant can be used 
in a variational approach or an MCMC algorithm, for example. In [7], bounds on det(L -|- I) are 
proposed, requiring only the first m eigenvalues of L, where m is chosen adaptively at each MCMC 
iteration to make the acceptance decision possible. This still requires the application of power iteration 
methods, which are limited to the finite domain case, require storing the whole n x n matrix L, and 
are prohibitively slow when the number of required eigenvalues m is large. 


2.2 Nonspectral bounds on the likelihood 

Let us denote by L_ 4 b the submatrix of L where row indices correspond to the elements of A, and 
column indices to those of B. When A = B, we simply write for s-nd we drop the subscript 

when A = y. Drawing inspiration from sparse approximations to Gaussian processes using inducing 
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variables [11], we let Z = {zi,... ,Zm} be an arbitrary set of points in and we approximate L by 
Q = Lyz[Lzj~^Lzy- Note that we do not constrain Z to belong to y, so that our bounds do not rely 
on a Nystrom-type approximation [13]. We term Z “pseudo-inputs”, or “inducing inputs”. 


Proposition 1. 


1 


det(Q -1-1) 


„-tr(L-Q) 


< 


< 


det(L-l-I) det(Q-l-I) 


(3) 


Proof. The right inequality is a straightforward consequence of the Schur complement L — Q being 
positive semidefinite. For instance, once could remark that for v G K", 

v^Lv < 1 v^Qv < 1, 


so that 

/ 1{vT(L+I)v<1}C?V < / 1{vT(Q+I)v<1}'^V. 

jR" JR" 

A change of variables using the Cholesky decompositions of L-l-I and Q+I yields the desired inequality. 

The left inequality in (3) can be proved along the lines of [11], using variational arguments (as 
will be discussed in detail in Section 3.1.1). Next we give an alternative, more direct proof based on 
an inequality on determinants [14, Theorem 1]. For any real symmetric matrix A = Pdiag(Ai)P^, 
define its absolute value as |A| = PdiagdAiDP"^. In particular, for a positive semidefinite A, \A\ = A. 
Applying [14, Theorem 1] and noting that L, Q and L — Q are positive semidefinite, it comes 


det(L-l-I) = det(L — Q -I- Q + 1) 

< det(|L-Q|-kI)det(|Q|+I) 

= det(L-Q-kI)det(Q-kI) (4) 

Now, denote by A^ the eigenvalues of L — Q, which are all nonnegative. It comes 

n n 

det(L - Q -p I) = J|(l + A) < n (5) 

z=l i=l 


where we used the inequality 1 -P x < Plugging (5) into (4) yields the left part of (3). 


□ 


3 Learning a DPP using bounds 

In this section, we explain how to run variational inference and Markov chain Monte Carlo methods 
using the bounds in Proposition 1. In this section, we also make connections with variational sparse 
Gaussian processes more explicit. 

3.1 Variational inference 

The lower bound in Proposition 1 can be used for variational inference. Assume we have T point 
process realizations Yi,...,1t) and we fit a DPP with kernel L = Lg. The log likelihood can be 
expressed using (2) 

T 

£(0) = ^logdet(LvJ -riogdet(L-Pl). (6) 

2=1 
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Let Z be an arbitrary set of m points in Proposition 1 then yields a lower bound 

T 

^logdet(LrJ-Tlogdet(Q + I) + rtr(L-Q) <£(6<). (7) 

t=i 

The lower bound J-{6, Z) can be computed efficiently in 0{nm?) time. Instead of maximizing £{0), we 
can maximize J-{0, Z) jointly with respect to the kernel parameters 6 and the variational parameters 
Z. 

To maximize (7), one can e.g. implement an EM-like scheme, alternately optimizing in Z and 
Q. Kernels are often differentiable with respect to 0, and sometimes T will also be differentiable 
with respect to the pseudo-inputs Z, so that gradient-based methods can help. In the general case, 
black-box optimizers such as CMA-ES [15], can also be employed. 

3.1.1 Connections with variational inference in sparse GPs 

We now discuss the connection between the variational lower bound in Proposition 1 and the variational 
lower bound used in sparse Gaussian process (GP; [10]) models [11]. Although we do not explore this 
in the current paper, this connection could extend the repertoire of variational inference algorithms 
for DPPs by including, for instance, stochastic optimization variants. 

Assume function / follows a GP distribution with zero mean function and kernel function L, so 
that the vector f of function values evaluated at y follows the Gaussian distribution A/'(f [0, L). Then, 
the standard Gaussian integral yields^ 

( 8 ) 

Following this interpretation, we augment the vector f with a vector of m extra auxiliary function 
values u, referred to as inducing variables, evaluated at the inducing points Z so that jointly (f,u) 
follows 

L?)). 

= M{i\Lyz[Lz]-^u,L- Q)Af(ul0,L2). (9) 

Now by using the fact that A/’(flO,L) = f p({,u)du, the integral in (8) can be expanded so that 

(^JAf{{\Lyz[Lz]-^u,L-Q)Af{u\0,Lz)e~^^^U{duj . ( 10 ) 

We can bound the above integral using Jensen’s inequality and the variational distribution g(f, u) = 
Af{f\Lyz[hz]~^u,L - Q)g(u), where ^(u) is a marginal variational distribution over the inducing 
variables u. This form of variational distribution is exactly the one used for sparse GPs [11], and by 
treating the factor ^(u) optimally we can recover the left lower bound in Proposition 1, following the 
lines of [11]. We provide details in Appendix A. 

The above connection suggests that much of the technology developed for speeding up GPs can 
be transferred to DPPs. For instance, if we explicitly represent the ( 7 (u) variational distribution in 

iNotice that J WCf |0, L)e" = (2w)"/^ f Wlf |0, LjATjOlf, I)df = (27r)"/2At(0|0, L-(-1), and (8) follows. 
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the above formulation, then we can develop stochastic variational inference variants for learning DPPs 
based on data subsampling [16]. In other words, we can apply to DPPs stochastic variational inference 
algorithms for sparse GPs such as [17]. 


3.2 Markov chain Monte Carlo inference 

If approximate inference is not suitable, we can use the bounds in Proposition 1 to build a more 
expensive Markov chain Monte Carlo [18] sampler. Given a prior distribution p{6) on the parameters 
0 of L, Bayesian inference relies on the posterior distribution 7r(0) oc exp(£(6>))p(6>), where the log 
likelihood l{9) is defined in (6). A standard approach to sample approximately from 7r(6*) is the 
Metropolis-Hastings algorithm (MH; [18, Chapter 7.3]). MH consists in building an ergodic Markov 
chain of invariant distribution 7r(0). Given a proposal q{6'\9), the MH algorithm starts its chain at a 
user-defined 9q, then at iteration /c -|- 1 it proposes a candidate state 9' ~ q{-\9k) and sets 9k+i to 9' 
with probability 


a{9k,9') 


min 


q{9k\9') 
’ q{9'\9k) 


( 11 ) 


while 9k+i is otherwise set to 9k. The core of the algorithm is thus to draw a Bernoulli variable 
with parameter a = a{9,9') for 9,9' G This is typically implemented by drawing a uniform 
u ^ 7/[op] and checking whether u < a. In our DPP application, we cannot evaluate a. But we can use 
Proposition 1 to build a lower and an upper bound £(0) G \h_{9, Z) ,b+ (9, Zy\, which can be arbitrarily 
refined by increasing the cardinality of Z and optimizing over Z. We can thus build a lower and upper 
bound for a 


h.{9',Z')-b+{9,Z) + \og 


' p{o'r 

_p{9) _ 


< logo < b^{9',Z') 


b-{9,Z) +log 


' pjer 

_p{9)_ ■ 


( 12 ) 


Now, another way to draw a Bernoulli variable with parameter a is to first draw u ~ 77[o,i]i then 
refine the bounds in (12), by augmenting the numbers |2^|, of inducing variables and optimizing 
over Z, Z', until logit is out of the interval formed by the bounds in (12). Then one can decide whether 
u < a. This Bernoulli trick is sometimes named retrospective sampling and has been suggested as early 
as [19]. It has been used within MH for inference on DPPs with spectral bounds in [7], we simply 
adapt it to our non-spectral bounds. 


4 The case of continuous DPPs 

DPPs can be defined over very general spaces [5] . We limit ourselves here to point processes on A C 
such that one can extend the notion of likelihood. In particular, we define here a DPP on X as in 
[1, Example 5.4(c)], by defining its Janossy density. For definitions of traces and determinants of 
operators, we follow [20, Section VH]. 

4.1 Definition 

Let /i be a measure on that is continuous with respect to the Lebesgue measure, with 

density p!. Let L be a symmetric positive definite kernel. L defines a self-adjoint operator on L^{p) 
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through £(/) = /L(x,y)/(y)d/r(y). Assume C is trace-class, and 


tr(£) = j L(x,x)d//(x). (13) 

Jx 

We assume (13) to avoid technicalities. Proving (13) can be done by requiring various assumptions 
on L and /i. Under the assumptions of Mercer’s theorem, for instance, (13) will be satisfied [20, 
Section VII, Theorem 2.3]. More generally, the assumptions of [21, Theorem 2.12] apply to kernels 
over noncompact domains, in particular the Gaussian kernel with Gaussian base measure that is often 
used in practice. We denote by Xi the eigenvalues of the compact operator £. There exists [1, Example 
5.4(c)] a simple^ point process on such that 


/There are n particles, one in each of 
the infinitesimal balls dxi) 


det((L(x,,Xj-)) 
det(I -I- £) 


n'ixi). 


(14) 


where B{x,r) is the open ball of center x and radius r, and where det(I -|- £) A + 1) is the 

Fredholm determinant of operator £ [20, Section VII]. Such a process is called the determinantal point 
process associated to kernel L and base measure Equation (14) is the continuous equivalent of 
(2). Our bounds require tl/ to be computable. This is the case for the popular Gaussian kernel with 
Gaussian base measure. 


4.2 Nonspectral bounds on the likelihood 

In this section, we derive bounds on the likelihood (14) that do not require to compute the Fredholm 
determinant det(I -I- £). 

Proposition 2. Let Z = {zi, ... ,Zm} C then 

__detL 2 _^ - f L(x,x)dfi(x)+tr(L~^'i’) ^ ^ det L^; 

det(L 2 -k’®') - det(I-k£) “ det(LzwW’ ^ ^ 

where = ((£(zi,Zj)) and 4'^ = j £(zi, x)£(x, Zj)d/J,(x). 

To see how similar (15) is to (3), we define Qz to be the operator on L^(/i) associated to the kernel 


Q2(x,x')=£(x,Z)L2^£(Z,x'), (16) 

where 

L(x, Z) = £(Z,x)^ = (£(x,zi) ... L(x,Zm)). 

Then the following extension of the matrix-determinant lemma shows that the common factor in the 
left and right hand side of (15) is the inverse of det(I -|- Qz), as in (3). 

Lemma 4.1. With the notation of Section 4, it holds 


det(I -I- Qz) = 


det(L2 -I- fit) 
detL^: 


^i.e., for which all points in a realization are distinct. 

^There is a notion of kernel K for general DPPs [o], but we define L directly here, for the sake of simplicity. The 
interpretability issues of using L instead of K are the same as for the finite case, see Sections 2 and 5. 
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Proof. First note that Qz has finite rank since for / G 

M 

Qzf= ^ [L-i]„„L(z„.) 

u,v—l 


J L{ 2 y,y)f{y)d^i{y) G S 


with 

S = Span (L(zi, •); 1 ^ ^ M). 

Note also that the L(zi,-)’s are linearly independent since L is a positive definite kernel. Now let 
{4'i)i<i<M be an orthonormal basis of S, i.e. Span(0i; 1 < i < M) = S and 


^ij : 


and define the matrix W by 

where (•,•) denotes the inner product of L^{y). By definition of the Fredholm determinant for finite 
rank operators [20, Section VII. 1 or Theorem VII.3.2], it comes 

det(J+ Qz) = det {{Sjk + 0i)))i<,j<„- 

Since 

M 

^ ^ mi\^z ]fn'n 


it comes 

det(J + Qz) = det(I + W^L^^W). 
Applying the classical matrix determinant lemma, it comes 

det(Lz + WW^) 


det(T + Qz) = 


detL; 


We finally remark that 


M 


[WW^],, = ^(L(z„.),0fc)(L(z„.),0fe) 


/ M 


= {'^{L{2„-),(l)k)4'k, L{zj,-) 


U=1 




□ 


Proof, (of Proposition 2) We first prove the right inequality in (15). From [20, Section VII.7], using 
(13), it holds 

oo 1 

det{I + C) = 1 + X! w / bet ((L(xi,xj)))d/i(xi).. .(i/r(xfc). 

Z-1 ’ 'J 


( 17 ) 



We now apply the same argument as in the proof of the finite case (proof of Proposition 1). Denoting 
Lxy = det((L(xi, Yi))) and Lx = ((det L(xi, x^-)), we know from the positive definiteness of the kernel 
L that hx — is positive semidefinite, which yields 

detL;f > det'Lxz^‘z^^‘ZX- 

Plugging this into (17) yields the right inequality in (15). 

Upon noting that 


tr(Qz) = z«)i(z«, x)d^(x) = tr(L 2 ^’®'), 

mn 

the proof of the left inequality in (15) follows the lines of the proof of Proposition 1, since the main 
tool [14, Theorem 1] is valid for any trace-class operators. □ 

5 Experiments 

5.1 A toy Gaussian continuous experiment 

In this section, we consider a DPP on K, so that the bounds derived in Section 4 apply. As in [7, 
Section 5.1], we take the base measure to be proportional to a Gaussian, i.e. its density is = 

nN(x\Q,{2a)~'^). We consider a squared exponential kernel L{x,y) = exp (—e^jja; — yjp). In this 
particular case, the spectral decomposition of operator C is known [22]"^: the eigenfunctions of C are 
scaled Hermite polynomials, while the eigenvalues are a geometrically decreasing sequence. This ID 
Gaussian-Gaussian example is interesting for two reasons: first, the spectral decomposition of C is 
known, so that we can sample exactly from the corresponding DPP [5] and thus generate synthetic 
datasets. Second, the Fredholm determinant det(T-|-£) in this special case is a q-Pochhammer symbol, 
and can thus be efficiently computed®, which allows for comparison with “ideal” likelihood-based 
methods, to check the validity of our MGMG sampler, for instance. We emphasize that these special 
properties are not needed for the inference methods in Section 3, they are simply useful to demonstrate 
their correctness. 

We sample a synthetic dataset using (k, a,e) = (1000,0.5,1), resulting in 13 points shown in red 
in Figure 1(a). Applying the variational inference method of Section 3.1, jointly optimizing in Z 
and 9 = (k, a, e) using the GMA-ES optimizer [15], yields poorly consistent results: k varies over 
several orders of magnitude from one run to the other, and relative errors for a and e go up to 100% 
(not shown). We thus investigate the identifiability of the parameters with the retrospective MH of 
Section 3.2. To limit the range of k, we choose for (log k, log a, log e) a wide uniform prior over 

[200,2000] X [-10,10] X [-10,10]. 

We use a Gaussian proposal, the covariance matrix of which is adapted on-the-fiy [23] so as to reach 
25% of acceptance. We start each iteration with to = 20 pseudo-inputs, and increase it by 10 and re¬ 
optimize when the acceptance decision cannot be made. Most iterations could be made with to = 20, 
and the maximum number of inducing inputs required in our run was 80. We show the results of a run 
of length 10 000 in Figure 5.1. Removing a burn-in sample of size 1000, we show the resulting marginal 

^We follow the parametrization of [22] for ease of reference. 

®http: //docs. sympy. org/latest/modules/mpmath/functions/qfmictions .htnil#q-pochheunmer-symbol 
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Figure 1: Results of running adaptive Metropolis-Hastings in the ID Gaussian continuous experiment 
of Section 5.1. Figure 1(a) shows data in red, a set of optimized pseudo-inputs in black for 9 set to the 
value used in the generation of the synthetic dataset, and the marginal of one point in the realization 
in blue. Figures 1(b), 1(c), and 1(d) show marginal histograms of K,a,e. 


histograms in Figures 1(b), 1(c), and 1(d). Retrospective MH and the ideal MH agree. The prior pdf 
is in green. The posterior marginals of a and e are centered around the values used for simulation, 
and are very different from the prior, showing that the likelihood contains information about a and e. 
However, as expected, almost nothing is learnt about a, as posterior and prior roughly coincide. This 
is an example of the issues that come with parametrizing L directly, as mentioned in Section 1. 

To conclude, we show a set of optimized pseudo-inputs Z in black in Figure 1(a). We also su¬ 
perimpose the marginal of any single point in the realization, which is available through the spectral 
decomposition of C here [5]. In this particular case, this marginal is a Gaussian. Interestingly, the 
pseudo-inputs look like evenly spread samples from this marginal. Intuitively, they are likely to make 
the denominator in the likelihood (14) small, as they represent an ideal sample of the Gaussian- 
Gaussian DPP. 

5.2 Diabetic neuropathy dataset 

Here, we consider a real dataset of spatial patterns of nerve fibers in diabetic patients. These nerve 
fibers become more clustered as diabetes progresses [24]. The dataset consists of seven samples collected 
from diabetic patients at different stages of diabetic neuropathy and one healthy subject. We follow 
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the experimental setup used in [7] and we split the total samples into two classes: Normal/Mildly 
Diabetic and Moderately/Severely Diabetic. The first class contains three samples and the second one 
the remaining four. Figure 2 displays the point process data, which contain on average 90 points per 
sample in the Normal/Mildly class and 67 for the Moderately/Severely class. We wish to investigate 
the differences between these classes by fitting a separate DPP to each class and then quantify the 
differences of the repulsion or overdispersion of the point process data through the inferred kernel 
parameters. Paraphrasing [7], we consider a continuous DPP on with kernel function 


L(xi,Xj) = exp 


E 




^j,d) 


(18) 


and base measure proportional to a Gaussian /r'(x) = «:rifciPd)- [^]i quantify the 

overdispersion of realizations of such a Gaussian-Gaussian DPP through the quantities -jd = crdipd, 
which are invariant to the scaling of x. Note however that, strictly speaking, k also mildly influences 
repulsion. 

We investigate the ability of the variational method in Section 3.1 to perform approximate maximum 
likelihood training over the kernel parameters 0 = (k, cti, CT 2 , pi, P 2 )- Specifically, we wish to fit a 
separate continuous DPP to each class by jointly maximizing the variational lower bound over 9 and 
the inducing inputs Z using gradient-based optimization. Given that the number of inducing variables 
determines the amount of the approximation, or compression of the DPP model, we examine different 
settings for this number and see whether the corresponding trained models provide similar estimates 
for the overdispersion measures. Thus, we train the DPPs under different approximations having 
m e {50,100,200,400, 800,1200} inducing variables and display the estimated overdispersion measures 
in Figures 3(a) and 3(b). These estimated measures converge to coherent values as m increases. They 
show a clear separation between the two classes, as also found in [7, 24]. Furthermore, Figures 3(c) and 
3(d) show the values of the upper and lower bounds on the log likelihood, which as expected, converge 
to the same limit as m increases. We point out that the overall optimization of the variational lower 
is relatively fast in our MATLAB implementation. For instance, it takes 24 minutes for the most 
expensive run where m = 1200 to perform 1000 iterations until convergence. Smaller values of m yield 
significantly smaller times. 

Finally, as in Section 5.1, we comment on the optimized pseudo-inputs. Figure 4 displays the 
inducing points at the end of a converged run of variational inference for various values of m. Similarly 
to Figure 1(a), these pseudo-inputs are placed in remarkably neat grids and depart significantly from 
their initial locations. 


normal mild1 mild2 modi mod2 sevi 



Figure 2: Six out of the seven nerve fiber samples. The first three samples (from left to right) 
correspond to a Normal Subject and two Mildly Diabetic Subjects, respectively. The remaining three 
samples correspond to a Moderately Diabetic Subject and two Severely Diabetic Subjects. 
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Figure 3: Figures 3(a) and 3(b) show the evolution of the estimated overdispersion measures 71 
and 72 as functions of the number of inducing variables used. The dotted black lines correspond 
to the Normal/Mildly Diabetic class while the solid lines to the Moderately/Severely Diabetic class. 
Figure 3(c) shows the upper bound (red) and the lower bound (blue) on the log likelihood as functions of 
the number of inducing variables for the Normal/Mildly Diabetic class while the Moderately/Severely 
Diabetic case is shown in Figure 3(d). 



Figure 4: We illustrate the optimization over the inducing inputs Z for several values of m G 
{50,100,200,400,800,1200} in the DPP of Section 5.2. We consider the Normal/Mildly diabetic 
class. The panels in the top row show the initial inducing input locations for various values of to, while 
the corresponding panels in the bottom row show the optimized locations. 

6 Discussion 

We have proposed novel, cheap-to-evaluate, nonspectral bounds on the determinants arising in the 
likelihoods of DPPs, both finite and continuous. We have shown how to use these bounds to infer 
the parameters of a DPP, and demonstrated their use for expensive-but-exact MCMC and cheap-but- 
approximate variational inference. In particular, these bounds have some degree of freedom - the 
pseudo-inputs which we optimize so as to tighten the bounds. This optimization step is crucial for 
likelihood-based inference of parametric DPP models, where bounds have to adapt to the point where 
the likelihood is evaluated to yield decisions which are consistent with the ideal underlying algorithms. 
In future work, we plan to investigate connections of our bounds with the quadrature-based bounds 
for Fredholm determinants of [25] . We also plan to consider variants of DPPs that condition on the 
number of points in the realization, to put joint priors over the within-class distributions of the features 
in classification problems, in a manner related to [ 6 ]. In the long term, we will investigate connections 
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between kernels L and K that could be made without spectral knowledge, to address the issue of 
replacing L by K. 
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A On the connection to variational sparse GPs 

Here, we provide further details about how the variational lower bound for DPPs over finite sets 
in Proposition 1 can be obtained by the variational approach to sparse GPs [11]. As mentioned in 
Section 3.1.1, it holds that 

det(L + I) ^ (/- Q)A/'(u|0,L2)e"5‘''^‘'dfdu^ . 

Taking logarithms yields 

log = 21og j Q)A/'(u|0,L2)e"5f'^fdfdu. 

A lower bound to the likelihood (2) can thus be obtained if we bound 

= log J A/’(f|Lj; 2 [L 2 ]“^u,L - Q)A/’(u| 0 ,L 2 )e“ 2 ^^^dfdu. 

This has a similar functional form with the marginal likelihood in a standard GP regression model: 
e“ 2 ^ ^ plays the role of an unnormalized Gaussian likelihood where the observation vector is equal to 
zero and the noise variance is equal to one. To lower bound the above we can consider the variational 
distribution ( 7 (f,u) = A/’(f|Lj; 2 :[L 2 ]“^u, L — Q)g(u) and apply Jensen’s inequality so that 

-A > yA/'(f|L 3 ; 2 [L 2 ]“^u,L - Q)g(u) log -dfdu, 

where the term A/'(f|L 3 ; 2 :[L 2 ]“^u, L — Q) cancels out inside the logarithm. This can be written as 

-A>y g(u)|-^ yA/'(f|L 3 ; 2 [L 2 ]"iu,L - Q)f^fdf Tlogl^^^^^^^y^l du. 

Further, given that 


M{i\Lyz[Lz]-^u,L- Q)Ffdf = a^a + tr(L - Q), 


where a = hyzl^tz] the bound can be written as 


-A > / q{u) log 


A/'(u|0 ,Lg)e 2 ' 

9(u) 


-du — ~ Q)- 
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Now if we analytically maximize w.r.t. q(u), under the constraint that q(u) is a distribution, we obtain 


9(u) 


N{\i\Q,'Lz)e 2 “^“ 

/ A/’(u|0, 


Plugging this optimal q back into the bound, we obtain 


-F>log J Af(u|0,L2)e-5“'^“du- itr(L-Q). 

After computing the Gaussian integral w.r.t. u, the r.h.s. reduces to the logarithm of the DPP bound 
for the finite case, see Proposition 1. 


B ^ matrix for Gaussian kernels 


We give here more details on the Gaussian kernel with Gaussian base measure used in the experimental 
Section 5. We use the notation of Section 5.2. The kernel is 


L{xi,Xj) = e 


Z^d=l 


with Gaussian base measure having density 


D 


d^l 


1 „ 2 f^d) 

, e 


In this Gaussian-Gaussian case, the dt matrix defined in Proposition 2 can be analytically computed: 
the ij-th element is given by 


[^]b = / L{z,,x)L{x,Zj)dfi{x) = kT[ -j-, (19) 

where Zd = 
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